⚡️ Speed up function merge_out_layout_with_ocr_layout by 30%
#4212
+21
−23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 30% (0.30x) speedup for
merge_out_layout_with_ocr_layoutinunstructured/partition/pdf_image/ocr.py⏱️ Runtime :
329 milliseconds→252 milliseconds(best of5runs)📝 Explanation and details
The optimized code achieves a 30% speedup through two key algorithmic improvements in
aggregate_embedded_text_by_blockandsupplement_layout_with_ocr_elements:Key Optimizations
1. Replaced
.sum(axis=1).astype(bool)with.any(axis=1)This change appears in both functions when computing boolean masks from the result of
bboxes1_is_almost_subregion_of_bboxes2():Why it's faster:
.sum(axis=1)creates an intermediate integer array by counting True values across columns, then converts to boolean.any(axis=1)short-circuits on the first True value per row, avoiding the full summation.astype(bool)conversion overheadPerformance impact: Based on line profiler, the mask computation in
aggregate_embedded_text_by_blockdropped from ~234ms to ~222ms (5% faster), and the overall function improved from 551ms to 443ms (19.6% faster).2. Avoided redundant slicing operations
In
aggregate_embedded_text_by_block, the optimized code storessliced = source_regions.slice(mask)once and reuses it, instead of callingsource_regions.slice(mask)three separate times:Why it's faster:
slice()operation creates a new object with coordinate and text array copies3. Early exit with
mask.any()The optimized code checks
if mask.any():before processing, avoiding unnecessary work when no regions match:Why it's faster:
Impact Based on Test Results
The optimization is particularly effective for workloads with:
Many elements requiring text aggregation (10-41% speedup on tests with 100-500 elements)
test_large_scale_many_elements_aggregated: 77ms → 67.2ms (14.6% faster)test_merge_large_number_of_elements: 43.8ms → 31.0ms (41.3% faster)test_merge_boundary_coordinates_large_scale: 87.3ms → 61.5ms (41.8% faster)Documents with invalid text patterns (10-20% speedup)
test_invalid_texts_are_replaced: 250μs → 222μs (12.4% faster)test_merge_with_all_invalid_text: 654μs → 554μs (18.1% faster)Complex spatial matching scenarios (33-36% speedup)
test_merge_with_overlapping_elements: 25.2ms → 18.9ms (33.3% faster)test_merge_with_varied_subregion_thresholds: 78.6ms → 57.6ms (36.4% faster)Context Impact
The function
merge_out_layout_with_ocr_layoutis called fromsupplement_page_layout_with_ocrin OCR processing hot paths, specifically whenocr_mode == OCRMode.FULL_PAGE. Each page processed invokes this function once, making the 30% speedup directly translate to faster document processing throughput for PDF/image partitioning workflows.✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
partition/pdf_image/test_ocr.py::test_merge_out_layout_with_cid_codepartition/pdf_image/test_ocr.py::test_merge_out_layout_with_ocr_layout🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-merge_out_layout_with_ocr_layout-mkrn264uand push.